Baseline

This file looks into the 1st and 2nd baselines that will be improved upon using deep learning models and also answers the questions below.

The following codes below are required to install the packages to access some of its libraries, the libraries imported are necessary to be used later in this project.

Assembling a Dataset

In this project, we will be using the full GoEmotions dataset from Kaggle to perform our feature extraction. As the dataset is split into 3 tab-separated values (tsv) files, we will concatenate them together and re-index the dataframe to create the full dataset containing all text data and emotion labels.

From the dataset, we will be predicting if a sample text is considered abusive which allows us to perform content filtering. From the present emotion labels, we can see that there are positive and negative emotions detected. As such, the negative emotions can be deemed as abusive while positive emotions can be deemed as non-abusive. This is because with texts with negative emotions tend towards having negative connotations, making them more suited to be under the abusive category.

We need to perform emotion mapping to show the emotions present in each sample of data text. This is done though the use of "ekman_mapping" and the "emotion_list" which contains all 29 emotion labels including the neutral emotion.

The emotions contained within each sample text is shown in Emotions. At this stage, there are still a total of 28 emotion labels including neutral.

After mapping is performed, we now only have 6 major emotion labels and neutral. They are mainly anger, sadness, fear, joy, disgust and surprise.

Following which , we perform one-hot encoding on the "Mapped Emotions" column to display the emotions of each text and to perform classification.

Data Exploration

The emotion neutral is diffcult to infer whether the text is abusive or non-abusive. Hence, we will exlude this class and related data points from the project.

We can see the distribution of classes and how many texts belong to a single class or more than one class (multi-labelled)

Data Cleaning

The dataset does not contain duplicated data, hence we will only be performing textual data cleaning

The functions below perform Natural Language Processing for Machine Learning, this can also be known as pre-processing data by cleaning the data. This highlights the more crucial attributes of the data.

We can see the number of emojis found and its respective type of emoji in the entire dataset.

Evaluation Methodology

From the diagram above, it shows that the 26 classes have different counts. This shows that the classes are not balanced, hence we need to use valid evaluation metrics as our evaluation methodology to obtain accurate classification.

In order to prove that our classifier works, we will be using various metrices.

Baseline performance

Random Guess/Naive Model

We can give an estimated probability of correctly classifying the text to the label without the use of any model, this is referred to random guess/naive model which can be performed by any individual. We will take a look at a baseline without referencing the dataset but simply by looking at the labels/classes present, in this case we will look at "anger, disgust, fear, joy, sadness, surprise". As there are a total of 6 labels, from classes 0 to 5, for each emotion. By proability, the percentage of accuracy is 16.67% at best where 100%/6 = 16.67% (2 decimal places).

Another guess would be at a lower probability of 2.78% when taking into consideration of the multi-label dataset where a sample text can belong to 2 different classes. This is achieved by 1/36 * 100 = 2.78% (2 decimal places).

Abusive VS Non-Abusive

Another method of obtaining our baseline is through a direct approach of simply categorizing the emotions into sub-groups of abusive or non-abusive, a random guess would generate 50%. However, we are also interested in how text representation will affect the classification. Hence, we will generate results by passing them through logistic regression.

Fit_Transform VS Transform

When performing Bag of Words and TF-IDF, we will only use fit_transform on the train data and transform on the test data. This is because we do not want Logistic Regresion to ask for classification based on the features input. As our training and test features are different (different texts which results in different words and different number of words present), the test data would reflect a different vocabulary compared to the training data vocabulary which the model is trained on.

On dataset considering the representation of emojis

Bag of Words (BoW)

A bag-of-words (BoW) is a representation of text that describes the appearance of a unique word within a text, without consideration of the word order but only the occurences of a unique word. Hence, all words are independent from each other. Each unique word in the dictionary will correspond to a (descriptive) feature.

We have passing in a pre-defined list of stop words form the English language that we do not want in our vocabulary.

BoW - Example

Below shows a dataframe containing the top 1500 most frequently used words in the dataset.

Analysis

However, as BoW simply counts the number of words in each text, it will give greater weightage to longer texts compared to shorter ones. Hence, we will also take a look at TF-IDF which takes into consideration common words such as word articles.

TF-IDF

Term Frequency–Inverse Document Frequency, represents the importance of each word is to a given document within a set of documents. It is a statistical measure that evaluates the relative frequency of a word in a document (Term Frequency) and the relative count of documents containing the word generated as a "log" (Inverse Document Frequency).

The importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus.

Similarly to Bag-of-Words, each word is independent of other words. However, unlike BoW model that contains the count of word occurences in a document, TF-IDF model contains information on both the important and less important words as well.

As we had used Count Vectorizer above in BoW, we can simply use TFIDF Transformer on the vectorized data. TFIDF Vectorizer is not used here as does what Count Vectorizer and TFIDF Transformer can do at once.

TF-IDF - Example

Below shows a dataframe containing the top 1500 most frequently used words in the dataset.

Analysis - TFIDF

TF-IDF alongside with BoW do not take into consideration the sequence of words, hence we will take a look at word embeddings.

Word Embeddings

Word Embeddings are a method of extracting features out of text so that we can input those features into a machine learning model to work with text data.

This target the disadvantage of the baselines above which does not take into account the word sequence in the text.

On dataset NOT considering the representation of emojis

Bag-of-Words

TF-IDF

Word Embeddings

Comparing the evaluation metrics generated using text representation on dataframes with emoji (df_emoji) and without emoji (df_without_emoji). We can see that the results on dataset with emoji representation is slightly better.

Multiclass Classification - OvR

As seen from the original dataset below, it is multi-label as each sample text may have one or more out of the 6 defined labels. Hence, we can use the OneVsRest (OvR) classifier which is commonly used for either multi-class or multi-label classification.

OvR performs such that each class is represented by one classifier by splitting the labels into multiple binary classifiction. For example, admiration VS not-admiration, followed by amusement VS not-amusement. This allows us to get a probability for each label.

The top three labels for "game hurt" are relief, grief and sadness which are 0, 4 and 5 respectively. From the dataset above, the cleaned text "game hurt" has one label of sadness which belongs to the top three predicted labels.

MultiLabel Classification

"Multilabel classification assigns to each sample a set of target labels. This can be thought as predicting properties of a data-point that are not mutually exclusive, such as topics that are relevant for a document." Meaning, each sample can belong to zero, one or more labels. In this case, each text can contain none, one or more than one emotion categories.

Binary Relevance is one of the most basic approaches to multi-label classification where each label is treated as a separate single class classification problem, the prediction output is the union of all per label classifiers.

From the bar chart above, we can say that although majority of the texts have only one emotion label. However, there are other text samples that have more than one emotion label or even have no emotion labels. Hence we will be taking a look at multi-label classification.

Binary Relevance

Multioutput Classifier

Baseline Analysis

From our models above, we can see that TF-IDF is most effective as a text representation. Hence, machine learning algorithms will use TF-IDF as input to perform classification. Among them, Multioutput Classifier generated the highest balanaced accuracy of 71% which will represent our second baseline to improve upon using deep learning.

Hence, we will use deep learning to take a look at the use of different text representations together with different deep learning models to generate better accuracy.